PyMongo Basics

MongoDB is one of a number of schemaless, NoSQL databases which have became popular for use in big data and in areas where the strict schemas of a SQL database don't always fit. PyMongo is a python distribution which allows us to work with MongoDB from python. You need to download and install PyMongo and also download and install MongoDB. Before using the below tutorial you must have a mongod instance running. (in the command line cd to where MongoDB was installed, cd to the bin folder and type mongod. For me, on windows, this was C:\Program Files\MongoDB\Server\3.0\bin and then I typed mongod.


In [1]:
from pymongo import MongoClient

client = MongoClient() #connects to the running mongod instance
users = client.test_database.user #creates database "test_database" and collection "user" if they do not already exist

In [2]:
users.remove({}) #Making sure that the collection is empty before I start


Out[2]:
{u'n': 0, u'ok': 1}

In SQL databases hold tables, tables contain rows and each row is made up of a number of columns. In NoSQL databases hold collections, collections hold documents and each document is made up of a number of fields. Documents are the NoSQL version of rows and are written in json, so we can write them as python dictionaries. Above we are using the database "test_database" and the collection "user". This is the collection we will insert our documents into.


In [3]:
user = {"name" : "Bob",
       "location" : "Ireland",
       "interests" : ["Java", "python"]} 

users.insert_one(user) #insert user into users collection as a document. insert() can also be used.


Out[3]:
<pymongo.results.InsertOneResult at 0x374bea0>

In [4]:
users.find() #find() returns a cursor


Out[4]:
<pymongo.cursor.Cursor at 0x374cc18>

In [5]:
users.find_one()


Out[5]:
{u'_id': ObjectId('55f2f7c08b92f7072c1f5c76'),
 u'interests': [u'Java', u'python'],
 u'location': u'Ireland',
 u'name': u'Bob'}

Above we created a document with name, location and interests fields. The fields above take strings or a list of strings but it is also possible to give them ints, floats, even other documents and many other types. Using Javascript typing find() would return all documents in a collection but in python it returns a cursor which can be used to return all documents. find_one() (findOne() in Javascript) returns the first document from the collection.

Below we add another user and use the cursor to iterate over all the users and print them. This user has a different number of fields to our first user, location is not declared. This would cause an error in SQL but is perfectly acceptable in NoSQL.


In [6]:
user = {"name" : "Ted",
       "interests" : ["data science", "R"]} #Note that this document does not contain the same fields (columns) as the one above
users.insert_one(user)


Out[6]:
<pymongo.results.InsertOneResult at 0x374b678>

In [7]:
for user in users.find(): #use the cursor to return all documents in the collection
    print user #the order the fields will be printed in can't be guaranteed


{u'interests': [u'Java', u'python'], u'_id': ObjectId('55f2f7c08b92f7072c1f5c76'), u'name': u'Bob', u'location': u'Ireland'}
{u'interests': [u'data science', u'R'], u'_id': ObjectId('55f2f7c68b92f7072c1f5c77'), u'name': u'Ted'}

insert_many() can be used for bulk inserts. Again note that the first user has an extra field not declared by any of the other users.


In [8]:
#Insert many users at once

new_users = [{"name" : "Mike",
              "occupation" : "Data Scientist",
              "interests" : ["data science", "machine learning", "python", "R"]},
             {"name" : "Elliot",
              "interests" : ["programming"]}]
users.insert_many(new_users)


Out[8]:
<pymongo.results.InsertManyResult at 0x374bcf0>

In [9]:
for user in users.find():
    print user["name"] #print out just the users name


Bob
Ted
Mike
Elliot

While we know that each user has included their name, if we try to iterate over a field that is not present in all documents, such as the occupation field, we will get an error. Code such as below should be used to prevent an error being thrown.


In [10]:
#printing the occupation for all users will throw an error as only one document contains this information
for user in users.find():
    try:
        print user["occupation"]
    except (Exception):
        pass


Data Scientist

In [11]:
#find returns the cursor position of a specific user
print users.find({"name" : "Bob"}) #using find_one() instead would return the first user found whose name is Bob.


<pymongo.cursor.Cursor object at 0x000000000374CCF8>

Bob has gotten a job and wants to update his profile. users.find_one({"name" : "Bob"}) returns the first user whose name is Bob. As with a dictionary we add an occupation field and set its value to "programmer" and then update the first returned Bob with the new document. Note that the ObjectId returned below is identical to the ObjectID returned the first time we inserted Bob into the collection. The ObjectId is the unique identifier of a row and this shows that we have updated the row, not made a new one.


In [12]:
update_user = users.find_one({"name" : "Bob"})
update_user["occupation"] = "programmer"
users.update({"name" : "Bob"}, update_user)

print users.find_one({"name" : "Bob"})


{u'interests': [u'Java', u'python'], u'location': u'Ireland', u'_id': ObjectId('55f2f7c08b92f7072c1f5c76'), u'name': u'Bob', u'occupation': u'programmer'}

It turns out that "Elliot" is actually a bot account made with the purpose of spamming the other users. remove({"name" : "Elliot"}) will remove all Elliots in our collection which is ok as we only have one. For larger collections all removing or updating should be done on the ObjectID. When we use find_one() afterwards nothing is returned as there is now now user with the name Elliot in the collection.


In [13]:
users.remove({"name" : "Elliot"})
users.find_one({"name" : "Elliot"})

An example of how to delete data from a user is shown below. We pull Mike's data out of the collection and then pop off the occupation field.


In [14]:
update_user = users.find_one({"name" : "Mike"})
update_user.pop("occupation")
print update_user


{u'interests': [u'data science', u'machine learning', u'python', u'R'], u'_id': ObjectId('55f2f7ca8b92f7072c1f5c78'), u'name': u'Mike'}

Here we print off the ObjectId for Mike. This returns to us a string which we can not use in searching the collection. To do so we must convert it into an ObjectId by importing ObjectId from bson.objectid (binary javascript object notation). Below we get Mike's ObjectId as a string, convert it and search the collection for his document, update his document and then search for the new entry, all using his unique ObjectId.


In [15]:
print update_user["_id"]


55f2f7ca8b92f7072c1f5c78

In [16]:
from bson.objectid import ObjectId

id_as_string = update_user["_id"]
print users.find_one({"_id" : ObjectId(id_as_string)})

users.update({"_id" : ObjectId(id_as_string)}, update_user)
print ""
print users.find_one({"_id" : ObjectId(id_as_string)})


{u'interests': [u'data science', u'machine learning', u'python', u'R'], u'_id': ObjectId('55f2f7ca8b92f7072c1f5c78'), u'name': u'Mike', u'occupation': u'Data Scientist'}

{u'interests': [u'data science', u'machine learning', u'python', u'R'], u'_id': ObjectId('55f2f7ca8b92f7072c1f5c78'), u'name': u'Mike'}

PyMongo allows us to leverage other great python modules such as Pandas. Here we import pandas, convert the cursor returned by using find() into a list and then pass turn that into a new DataFrame. All missing values are replaced with NaN and we can use pandas fillna() to replace these with suitable values


In [17]:
import pandas as pd

df = pd.DataFrame(list(users.find()))
print df.head()


                        _id                                    interests  \
0  55f2f7c68b92f7072c1f5c77                            [data science, R]   
1  55f2f7ca8b92f7072c1f5c78  [data science, machine learning, python, R]   
2  55f2f7c08b92f7072c1f5c76                               [Java, python]   

  location  name  occupation  
0      NaN   Ted         NaN  
1      NaN  Mike         NaN  
2  Ireland   Bob  programmer  

In [19]:
df = df.fillna("None")
print df


                        _id                                    interests  \
0  55f2f7c68b92f7072c1f5c77                            [data science, R]   
1  55f2f7ca8b92f7072c1f5c78  [data science, machine learning, python, R]   
2  55f2f7c08b92f7072c1f5c76                               [Java, python]   

  location  name  occupation  
0     None   Ted        None  
1     None  Mike        None  
2  Ireland   Bob  programmer  

In [ ]: